join = read.csv("guild-activation.csv")
join
leave = read.csv("guild-leavers.csv")
leave
source = read.csv("guild-joins-by-source.csv")
source
text = read.csv("popular-text-channels.csv")
text
voice_channel = read.csv("popular-voice-channels.csv")
voice_channel
message = read.csv("guild-message-activity.csv")
message
voice = read.csv("guild-voice-activity.csv")
voice
communicator = read.csv("guild-communicators.csv")
communicator
library(lubridate)
I grabbed this example from astrostats.psu. The Berkely Stat Dates page Dates and Times in R was a great reference for the code and values for datetime
| Code | Value |
|---|---|
| %d | Day of the month (decimal number) |
| %m | Month (decimal number) |
| %b | Month (abbreviated) |
| %B | Month (full name) |
| %y | Year (2 digit) |
| %Y | Year (4 digit) |
## read in date/time info in format 'm/d/y h:m:s'
dates <- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")
times <- c("23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26")
x <- paste(dates, times)
strptime(x, "%m/%d/%y %H:%M:%S")
[1] "1992-02-27 23:03:20 CST" "1992-02-27 22:29:56 CST" "1992-01-14 01:03:30 CST" "1992-02-28 18:21:03 CST" "1992-02-01 16:56:26 CST"
strptime(x, "%m/")
[1] NA NA NA NA NA
These were scuffed tests I used to learn how to extract the date time * the variable test made me realize removing +00:00 and replacing it with a Z would make the data in a format that can be read by R * the variable test2 was my attempt to try getting it for an entire column
test = "2021-03-27T00:00:00Z"
str(ymd_hms(test))
POSIXct[1:1], format: "2021-03-27"
test2 = join$interval_start_timestamp
#test2
#ymd_hms(join$interval_start_timestamp)
#strptime(test2, "%Y-%m-%dT%H:%M:%SZ")
While performing my tests, I struggled understanding format of the date was in, a search of a 2021-03-27T00:00:00+00:00 datatype pointed me to a stack overflow page that helped me learn more about python functions Date Time Formats in Python.
a search of R remove all text after plus sign helped me break through this barrier I found that this answer on stackoverflow was particularly helpful in removing the + sign How to remove + (plus sign) from string in R?. gsub seemed to be the recommend choice among all answers
I found the following stackoverflow answer that had a example for how to remove the rest of a string Remove all text before colon. I couldn’t remember how to remove everything after the + so the following example from stevencarlislewalker’s blog was particularly helpful in refreshing my memory Remove (or replace) everything before or after a specified character in R strings
gsub("\\+.*", 'Z', "2021-03-27T00:00:00+00:00")
[1] "2021-03-27T00:00:00Z"
these were tests I ran to automate this for all the datetime rows.
#join[1,1] = gsub("\\+.*", 'Z', join[1,1])
#join
join[,1] = gsub("\\+.*", 'Z', join[,1])
join
NA
interval_start_timestampOnce I got it working on a row, I applied what I learned above to extract the year, month, and day from the initial datetime object Later when I was generating the bar charts, I had issues ordering the data by calendar months, a quick search yielded Sorting months in R I learned that passing months into factor with the levels = month.name would allow me to sort by the months
year = year(as.POSIXlt(join$interval_start_timestamp))
month = factor(months(as.POSIXlt(join$interval_start_timestamp)),levels = month.name)
day = weekdays(as.POSIXlt(join$interval_start_timestamp))
After making the split dataframes, I used a cbind to append the columns to the original dataset and reordered the dataset.
joins = cbind(join, year, month,day)
joins
joins = joins[,c(1,5,6,7,2,3,4)]
joins
# test to see what would happen if I could convert a months output as a factor
factor(months(as.POSIXlt(join$interval_start_timestamp)),levels = month.name)[1:20]
[1] March March March April April April April April April April April April April April April April April April April April
Levels: January February March April May June July August September October November December
run the following cell to extract year, month, day
# substring replacement
join[,1] = gsub("\\+.*", 'Z', join[,1])
# individual extraction
year = factor(year(as.POSIXlt(join[,1])))
month = factor(months(as.POSIXlt(join[,1])),levels = month.name)
day = weekdays(as.POSIXlt(join[,1]))
# appending new indivually extracted dates
joins = cbind(join, year, month,day)
joins = joins[,c(1,5,6,7,2,3,4)]
joins
# substring replacement
source[,1] = gsub("\\+.*", 'Z', source[,1])
# individual extraction
year = factor(year(as.POSIXlt(source[,1])))
month = factor(months(as.POSIXlt(source[,1])),levels = month.name)
day = weekdays(as.POSIXlt(source[,1]))
# appending new indivually extracted dates
sources = cbind(source, year, month,day)
sources = sources[,c(1,5,6,7,2,3,4)]
sources
# substring replacement
leave[,1] = gsub("\\+.*", 'Z', leave[,1])
# individual extraction
year = factor(year(as.POSIXlt(leave[,1])))
month = factor(months(as.POSIXlt(leave[,1])),levels = month.name)
day = weekdays(as.POSIXlt(leave[,1]))
# appending new indivually extracted dates
leave
leaves = cbind(leave, year, month,day)
leaves
leaves = leaves[,c(1,4,5,6,2,3)]
leaves
# substring replacement
message[,1] = gsub("\\+.*", 'Z', message[,1])
# individual extraction
year = factor(year(as.POSIXlt(message[,1])))
month = factor(months(as.POSIXlt(message[,1])),levels = month.name)
day = weekdays(as.POSIXlt(message[,1]))
# appending new indivually extracted dates
messages = cbind(message, year, month,day)
messages
messages = messages[,c(1,4,5,6,2,3)]
messages
# substring replacement
voice[,1] = gsub("\\+.*", 'Z', voice[,1])
# individual extraction
year = factor(year(as.POSIXlt(voice[,1])))
month = factor(months(as.POSIXlt(voice[,1])),levels = month.name)
day = weekdays(as.POSIXlt(voice[,1]))
# appending new indivually extracted dates
voices = cbind(voice, year, month,day)
voices = voices[,c(1,3,4,5,2)]
voices
# substring replacement
communicator[,1] = gsub("\\+.*", 'Z', communicator[,1])
# individual extraction
year = factor(year(as.POSIXlt(communicator[,1])))
month = factor(months(as.POSIXlt(communicator[,1])),levels = month.name)
day = weekdays(as.POSIXlt(communicator[,1]))
communicator
# appending new individually extracted dates
communicators = cbind(communicator, year, month,day)
communicators = communicators[,c(1,4,5,6,2,3)]
communicators$total_communicated = communicators$visitors * communicators$pct_communicated/100
The following modifications are my attempts to identify covid years for our analysis, I could edit the csv, but I decided to explore R to practice etl for larger datasets. The Fall 2017 STAT 200 course page on Regression With Factor Variables was particularly helpful as a reference when I was trying to have R use Covid as the default factor instead of Normal, having Covid as the default factor will be important when I generate the linear models and interpret the outputs. I would also recommend reading the berkley stats page on “Factors in R” to get a deeper understanding of how to convert factors with dates
I could have applied the relevel() to the as.factor line as seen in this stack overflow answer How to force R to use a specified factor level as reference in a regression?, but I realized it was much easier to read/run the code in my head line by line than to pass into multipe functions
# marking covid and non covid months
joins$year_type = as.double(joins$year)
joins$year_type[joins$year_type == 1 ] <- "Normal"
joins$year_type[joins$year_type == 2] <- "Covid"
joins$year_type[joins$year_type == 3] <- "Covid"
joins$year_type = as.factor(joins$year_type)
joins$year_type = relevel(joins$year_type, ref = 2)
joins
leaves$year_type = as.double(leaves$year)
leaves$year_type[leaves$year_type == 1 ] <- "Normal"
leaves$year_type[leaves$year_type ==2] <- "Covid"
leaves$year_type[leaves$year_type ==3] <- "Covid"
leaves$year_type = as.factor(leaves$year_type)
leaves$year_type = relevel(leaves$year_type, ref = 2)
leaves
sources$year_type = as.double(sources$year)
sources$year_type[sources$year_type == 1 ] <- "Normal"
sources$year_type[sources$year_type ==2] <- "Covid"
sources$year_type[sources$year_type ==3] <- "Covid"
sources$year_type = as.factor(sources$year_type)
sources$year_type = relevel(sources$year_type, ref = 2)
sources
messages$year_type = as.double(messages$year)
messages$year_type[messages$year_type == 1 ] <- "Normal"
messages$year_type[messages$year_type ==2] <- "Covid"
messages$year_type[messages$year_type ==3] <- "Covid"
messages$year_type = as.factor(messages$year_type)
messages$year_type = relevel(messages$year_type, ref = 2)
messages
voices$year_type = as.double(voices$year)
voices$year_type[voices$year_type == 1 ] <- "Normal"
voices$year_type[voices$year_type ==2] <- "Covid"
voices$year_type[voices$year_type ==3] <- "Covid"
voices$year_type = as.factor(voices$year_type)
voices$year_type = relevel(voices$year_type, ref = 2)
voices
communicators$year_type = as.double(communicators$year)
communicators$year_type[communicators$year_type == 1 ] <- "Normal"
communicators$year_type[communicators$year_type ==2] <- "Covid"
communicators$year_type[communicators$year_type ==3] <- "Covid"
communicators$year_type = as.factor(communicators$year_type)
communicators$year_type = relevel(communicators$year_type, ref = 2)
communicators
joins
leaves
sources
messages
voices
communicators
text
voice
Originally I planned on aggregating by the year for my bar charts, but when I read through some more examples of aggregates, I found a better method in “Aggregating by category”
joins.2019 = subset(joins, year == 2019)
joins.2020 = subset(joins, year == 2020)
joins.2021 = subset(joins, year == 2021)
leaves.2019 = subset(leaves, year == 2019)
leaves.2020 = subset(leaves, year == 2020)
leaves.2021 = subset(leaves, year == 2021)
sources.2019 = subset(sources, year == 2019)
sources.2020 = subset(sources, year == 2020)
sources.2021 = subset(sources, year == 2021)
comm.2019 = subset(communicators, year == 2019)
comm.2020 = subset(communicators, year == 2020)
comm.2021 = subset(communicators, year == 2021)
joins.2019
leaves.2019
sources.2019
comm.2019
joins.2020
leaves.2020
sources.2020
comm.2020
joins.2021
leaves.2021
sources.2021
comm.2021
joins.2019
leaves.2019
comm.2019
agg_joins.2019 = aggregate(joins.2019$new_members, list(joins.2019$month), sum)
colnames(agg_joins.2019) <- c("Months", "Total New Members")
agg_leaves.2019 = aggregate(leaves.2019$leavers, list(leaves.2019$month), sum)
colnames(agg_leaves.2019) <- c("Months", "Total Leavers")
agg_comm.2019 = aggregate(comm.2019$total_communicated, list(comm.2019$month), sum)
colnames(agg_comm.2019) <- c("Months", "Total Communicated")
agg_joins.2019[order(med_joins.2019$x),]
agg_leaves.2019[order(med_leaves.2019$x),]
agg_comm.2019[order(med_comm.2019$x),]
joins.2020
leaves.2020
comm.2020
agg_joins.2020 = aggregate(joins.2020$new_members, list(joins.2020$month), sum)
colnames(agg_joins.2020) <- c("Months", "Total New Members")
agg_leaves.2020 = aggregate(leaves.2020$leavers, list(leaves.2020$month), sum)
colnames(agg_leaves.2020) <- c("Months", "Total Leavers")
agg_comm.2020 = aggregate(comm.2020$total_communicated, list(comm.2020$month), sum)
colnames(agg_comm.2020) <- c("Months", "Total Communicated")
agg_joins.2020[order(med_joins.2020$x),]
agg_leaves.2020[order(med_leaves.2020$x),]
agg_comm.2020[order(med_comm.2020$x),]
joins.2021
leaves.2021
comm.2021
agg_joins.2021 = aggregate(joins.2021$new_members, list(joins.2021$month), sum)
colnames(agg_joins.2021) <- c("Months", "Total New Members")
agg_leaves.2021 = aggregate(leaves.2021$leavers, list(leaves.2021$month), sum)
colnames(agg_leaves.2021) <- c("Months", "Total Leavers")
agg_comm.2021 = aggregate(comm.2021$total_communicated, list(comm.2021$month), sum)
colnames(agg_comm.2021) <- c("Months", "Total Communicated")
agg_joins.2021[order(med_joins.2021$x),]
agg_leaves.2021[order(med_leaves.2021$x),]
agg_comm.2021[order(med_comm.2021$x),]
communicators
median_comm = aggregate(communicators$visitors, list(communicators$month), sum)
median_comm[order(median_comm$x),]
As mentioned in the subsetting by year section, upon reading some examples for aggregating in R, I found that there was a method to aggregate by multiple columns. The following article “Aggregate in R” was particularly helpful as it had sample code with useful outputs. The second option of using R linear model notation is a bit more intuitive than the first suggestion.
aggregate(df_2$weight, by = list(df_2$feed, df_2$cat_var), FUN = sum)
# Equivalent to:
aggregate(weight ~ feed + cat_var, data = df_2, FUN = sum)
joins
agg_joins = aggregate(new_members ~ month + year, data = joins, FUN = sum)
agg_joins
leaves
agg_leaves = aggregate(leavers ~ month + year, data = leaves, FUN = sum)
agg_leaves
leaves
agg_leaves = aggregate(leavers ~ month + year, data = leaves, FUN = sum)
agg_leaves
looks really weird ignoring for now
sources
agg_sources = aggregate(discovery_joins + invites + vanity_joins ~ month + year, data = sources, FUN = sum)
agg_sources
communicators
agg_comms = aggregate(total_communicated ~ month + year, data = communicators, FUN = sum)
agg_comms
I realized that using R’s base plots were not going to make the cut. I recall that when I was searching for graphing solutions on a different project, I found an appealing graph solution with ggplots. At the time I was using python, so ggplot wasn’t a library supported. In another class, the professor introduced ggplots. I could have used excel to generate the plots, but I wanted a learning opportunity to try ggplot on something that wasn’t homework or classwork. I knew I needed a stacked bar chart as I needed to compare the changes across the months and years.
After a search on the web, I found the following guide “How to Create and Customize Bar Plot Using ggplot2 Package in R- One Zero Blog” on the towards data science medium to be particularly helpful, as there was sample code with outputs. I used the sample code from section on bar labels on a stack bar plot as a base and made modifications to fit my data.
To make it easier for me to input the parameters, I loaded all the aggregate data, since I wasn’t sure how the graphs would look.
library(ggplot2)
joins
agg_joins.2019
agg_joins.2020
agg_joins.2021
agg_joins
I started by substituting the sample parameters with my own dataset. I quickly realized that the graph had some issues on the x axis. The month names were overlapping.
all_joins = ggplot(data = agg_joins, mapping = aes(x = month, y = new_members, fill = year)) + xlab("Month") + ylab("Total New Members") + geom_col()+
geom_text(aes(label=new_members), position = position_stack(vjust= 0.5),
colour = "white", size = 5)
all_joins = all_joins + labs(title = "New Member Joins Across the Year")
all_joins
After searching the web, I found a great stack overflow answer How to maintain size of ggplot with long labels that ultimately influenced the final graphs.
all_joins = ggplot(data = agg_joins, mapping = aes(x = month, y = new_members, fill = year)) + xlab("Month") + ylab("Total New Members") + geom_col()+
geom_text(aes(label=new_members), position = position_stack(vjust= 0.5),
colour = "white", size = 5) + coord_flip()
all_joins = all_joins + labs(title = "New Member Joins Across the Year")
all_joins
When I first made the graphs, the order of the x axis was backwards from a normal year. For the presentation I used the version above, but when I came back for the final report and final write up, I decided to search for a solution. I knew previously that coord_flip() was the cause of the initial reversed order. Searching ggplot coord_flip() change order of x axis found the answer I was looking for. The following answer from Reversed order after coord_flip in R was had the solution I was looking for. I learned that I could use a limits parameter to change the order, as passing scale_x_discrete() with out any parameters wouldn’t change my graph.
Ultimately this is the final version of the graph. For the report, I scaled the horizontal dimension to be 1920 and had the aspect ratio fixed.
all_joins = ggplot(data = agg_joins, mapping = aes(x = month, y = new_members, fill = year)) + xlab("Month") + ylab("Total New Members") + geom_col()+
geom_text(aes(label=new_members), position = position_stack(vjust= 0.5),
colour = "white", size = 5) + coord_flip() + scale_x_discrete(limits = rev(levels(agg_joins$month)))
all_joins = all_joins + labs(title = "New Member Joins Across the Year")
all_joins
I decided to also make a graph for leaves, but it was ultimately scrapped because our analysis was more focused in the new user changes. Perhaps we can return to analyze the leaves
leaves
agg_leaves.2019
agg_leaves.2020
agg_leaves.2021
agg_leaves
all_leaves = ggplot(data = agg_leaves, mapping = aes(x = month, y = leavers, fill = year)) + xlab("Month") + ylab("Total Leaves") + geom_col()+
geom_text(aes(label=leavers), position = position_stack(vjust= 0.5),
colour = "white", size = 5) + coord_flip() + scale_x_discrete(limits = rev(levels(agg_leaves$month)))
all_leaves = all_leaves + labs(title = "Member Leaves Across the Year")
all_leaves
communicators
agg_comm.2019
agg_comm.2020
agg_comm.2021
agg_comms
all_comms = ggplot(data = agg_comms, mapping = aes(x = month, y = total_communicated, fill = year)) + xlab("Month") + ylab("Total Members Communicated") +
geom_col()+ geom_text(aes(label=total_communicated), position = position_stack(vjust= 0.5),
colour = "white", size = 5) + coord_flip() + scale_x_discrete(limits = rev(levels(agg_comms$month)))
all_comms = all_comms + labs(title = "All Communicating Members")
all_comms
This section contains the code for generating linear models for the other variables we were interested in. I followed my professor’s notes for setting up the parameters. For fun I decided to experiment with the messages dataset, as it included an additional variable of messages_per_communicator which gives a bit more granularity in comparing between individuals and aggregates for messages.
joins
joins_lm = lm(new_members ~ month + year_type, data = joins)
print(summary(joins_lm))
Call:
lm(formula = new_members ~ month + year_type, data = joins)
Residuals:
Min 1Q Median 3Q Max
-8.759 -2.195 -0.612 0.808 85.469
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.98132 0.80555 2.460 0.01414 *
monthFebruary 0.26401 0.96935 0.272 0.78543
monthMarch -0.01493 0.95690 -0.016 0.98756
monthApril 0.60450 0.98228 0.615 0.53848
monthMay -0.36969 0.97461 -0.379 0.70456
monthJune -0.46217 0.98228 -0.471 0.63814
monthJuly -0.30518 0.97461 -0.313 0.75428
monthAugust 6.54966 0.97461 6.720 3.70e-11 ***
monthSeptember 4.28783 0.98228 4.365 1.46e-05 ***
monthOctober 2.22708 0.97461 2.285 0.02260 *
monthNovember 2.25450 0.98228 2.295 0.02201 *
monthDecember -0.78905 0.97461 -0.810 0.41844
year_typeCovid 1.22836 0.44590 2.755 0.00602 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.283 on 716 degrees of freedom
Multiple R-squared: 0.1475, Adjusted R-squared: 0.1332
F-statistic: 10.32 on 12 and 716 DF, p-value: < 2.2e-16
messages
messages_lm = lm(messages ~ month + year_type, data = messages)
print(summary(messages_lm))
Call:
lm(formula = messages ~ month + year_type, data = messages)
Residuals:
Min 1Q Median 3Q Max
-533.72 -131.98 -34.98 68.19 2435.80
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 370.7838 37.3808 9.919 < 2e-16 ***
monthFebruary 0.7405 44.9820 0.016 0.98687
monthMarch 19.0476 44.4043 0.429 0.66808
monthApril 153.6371 45.5819 3.371 0.00079 ***
monthMay 24.6162 45.2261 0.544 0.58641
monthJune -73.9795 45.5819 -1.623 0.10503
monthJuly -42.4322 45.2261 -0.938 0.34845
monthAugust 210.2452 45.2261 4.649 3.98e-06 ***
monthSeptember 433.9371 45.5819 9.520 < 2e-16 ***
monthOctober 261.9549 45.2261 5.792 1.04e-08 ***
monthNovember 109.9371 45.5819 2.412 0.01612 *
monthDecember -79.8354 45.2261 -1.765 0.07795 .
year_typeCovid -193.5419 20.6915 -9.354 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 245.1 on 716 degrees of freedom
Multiple R-squared: 0.369, Adjusted R-squared: 0.3584
F-statistic: 34.89 on 12 and 716 DF, p-value: < 2.2e-16
messages
messages_lm1 = lm(messages ~ month + year_type + messages_per_communicator, data = messages)
print(summary(messages_lm1))
Call:
lm(formula = messages ~ month + year_type + messages_per_communicator,
data = messages)
Residuals:
Min 1Q Median 3Q Max
-794.57 -58.66 1.20 50.09 1112.68
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -80.219 22.265 -3.603 0.000337 ***
monthFebruary 44.936 23.694 1.896 0.058298 .
monthMarch 13.590 23.369 0.582 0.561041
monthApril 12.429 24.209 0.513 0.607821
monthMay -37.842 23.845 -1.587 0.112952
monthJune -2.577 24.045 -0.107 0.914678
monthJuly -33.459 23.802 -1.406 0.160241
monthAugust 128.790 23.875 5.394 9.36e-08 ***
monthSeptember 311.849 24.154 12.911 < 2e-16 ***
monthOctober 187.593 23.863 7.861 1.40e-14 ***
monthNovember 101.338 23.989 4.224 2.71e-05 ***
monthDecember -12.940 23.851 -0.543 0.587613
year_typeCovid -36.598 11.478 -3.189 0.001492 **
messages_per_communicator 55.895 1.292 43.247 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 129 on 715 degrees of freedom
Multiple R-squared: 0.8255, Adjusted R-squared: 0.8223
F-statistic: 260.2 on 13 and 715 DF, p-value: < 2.2e-16
messages
messages_lm2 = lm(messages_per_communicator ~ month + year_type, data = messages)
print(summary(messages_lm2))
Call:
lm(formula = messages_per_communicator ~ month + year_type, data = messages)
Residuals:
Min 1Q Median 3Q Max
-7.5431 -2.2972 -0.7784 1.2309 28.5756
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.06881 0.56882 14.185 < 2e-16 ***
monthFebruary -0.79070 0.68449 -1.155 0.24841
monthMarch 0.09763 0.67570 0.144 0.88515
monthApril 2.52633 0.69362 3.642 0.00029 ***
monthMay 1.11743 0.68821 1.624 0.10489
monthJune -1.27745 0.69362 -1.842 0.06593 .
monthJuly -0.16054 0.68821 -0.233 0.81561
monthAugust 1.45731 0.68821 2.118 0.03456 *
monthSeptember 2.18426 0.69362 3.149 0.00171 **
monthOctober 1.33040 0.68821 1.933 0.05361 .
monthNovember 0.15385 0.69362 0.222 0.82452
monthDecember -1.19681 0.68821 -1.739 0.08246 .
year_typeCovid -2.80785 0.31486 -8.918 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.73 on 716 degrees of freedom
Multiple R-squared: 0.2164, Adjusted R-squared: 0.2033
F-statistic: 16.48 on 12 and 716 DF, p-value: < 2.2e-16
voices
voices_lm = lm(speaking_minutes ~ month + year_type, data = voices)
print(summary(voices_lm))
Call:
lm(formula = speaking_minutes ~ month + year_type, data = voices)
Residuals:
Min 1Q Median 3Q Max
-928.94 -287.96 -21.33 150.04 2268.59
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 238.42 68.62 3.475 0.000542 ***
monthFebruary 53.85 82.57 0.652 0.514493
monthMarch 261.27 81.51 3.205 0.001409 **
monthApril -217.09 83.67 -2.595 0.009665 **
monthMay -269.06 83.02 -3.241 0.001246 **
monthJune -225.25 83.67 -2.692 0.007265 **
monthJuly -265.07 83.02 -3.193 0.001470 **
monthAugust 142.77 83.02 1.720 0.085914 .
monthSeptember 474.25 83.67 5.668 2.09e-08 ***
monthOctober 463.99 83.02 5.589 3.25e-08 ***
monthNovember 256.21 83.67 3.062 0.002280 **
monthDecember -9.41 83.02 -0.113 0.909785
year_typeCovid 216.28 37.98 5.694 1.81e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 450 on 716 degrees of freedom
Multiple R-squared: 0.2877, Adjusted R-squared: 0.2757
F-statistic: 24.1 on 12 and 716 DF, p-value: < 2.2e-16
communicators
communicators_lm = lm(total_communicated ~ month + year_type, data = communicators)
print(summary(communicators_lm))
Call:
lm(formula = total_communicated ~ month + year_type, data = communicators)
Residuals:
Min 1Q Median 3Q Max
-39.805 -7.258 -1.258 5.628 77.195
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.1266 1.9689 21.396 < 2e-16 ***
monthFebruary 4.9875 2.3693 2.105 0.03563 *
monthMarch 2.7318 2.3388 1.168 0.24318
monthApril 7.5910 2.4009 3.162 0.00163 **
monthMay -0.3536 2.3821 -0.148 0.88203
monthJune -0.4757 2.4009 -0.198 0.84300
monthJuly -1.8698 2.3821 -0.785 0.43277
monthAugust 20.6786 2.3821 8.681 < 2e-16 ***
monthSeptember 41.6910 2.4009 17.365 < 2e-16 ***
monthOctober 23.1141 2.3821 9.703 < 2e-16 ***
monthNovember 12.7410 2.4009 5.307 1.49e-07 ***
monthDecember -4.6601 2.3821 -1.956 0.05082 .
year_typeCovid -8.8685 1.0899 -8.137 1.79e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.91 on 716 degrees of freedom
Multiple R-squared: 0.552, Adjusted R-squared: 0.5445
F-statistic: 73.53 on 12 and 716 DF, p-value: < 2.2e-16
# dataframe_name[with(dataframe_name, order(column_name)), ]
df=voice[with(voice,order("communicators")),]
df